Systems and methods for deep model translation generation

ABSTRACT

Embodiments of the present invention relate to systems and methods for improving the training of machine learning systems to recognize certain objects within a given image by supplementing an existing sparse set of real-world training images with a comparatively dense set of realistic training images. Embodiments may create such a dense set of realistic training images by training a machine learning translator with a convolutional autoencoder to translate a dense set of synthetic images of an object into more realistic training images. Embodiments may also create a dense set of realistic training images by training a generative adversarial network (“GAN”) to create realistic training images from a combination of the existing sparse set of real-world training images and either Gaussian noise, translated images, or synthetic images. The created dense set of realistic training images may then be used to more effectively train a machine learning object recognizer to recognize a target object in a newly presented digital image.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 U.S.C. § 120 to U.S. patentapplication Ser. No. 15/705,504, entitled “Systems and Methods for DeepModel Translation Generation,” filed Sep. 15, 2017, which in turn claimspriority under 35 U.S.C. § 119(e) to U.S. Provisional Patent ApplicationNo. 62/395,841, entitled “Systems and Methods for Deep Model TranslationGeneration1,” filed Sep. 16, 2016.

GOVERNMENT RIGHTS

This invention was made with government support under Contract No.FA8650-15-C-7552 awarded by the United States Air Force. The governmenthas certain rights in the invention.

FIELD OF THE INVENTION

Embodiments of the present invention relate to systems and methods forimproving the training of machine learning systems to recognize certainobjects within an image. In this context, an “object” is typically aregion of an image or a set of regions corresponding to a physicalobject that can be recognized within a collection of image data.Training a machine learning system to recognize objects within images isnot the only application of the present invention. Although objectterminology and image terminology will be used frequently herein forconvenience, those of skill in the art will understand that the presentinvention applies to any pattern that can be recognized within acollection of data.

Embodiments of the present invention relate to systems and methods thatreduce the amount of data required to train a machine learning system bymeans of image translation and image generation using deep learning.

Embodiments of the present invention relate to systems and methods forimproving the training of machine learning systems to recognize certainobjects within a given image, where the number of real-world trainingimages is sparse, but the number of synthetically derived trainingimages is comparatively dense.

More particularly, embodiments of the present invention relate tosystems and methods for improving the training of machine learningsystems to recognize certain objects within a given image, by training amachine learning system using a set of realistic training images, wherethe set of realistic training images is created by a process thatincludes translating a set of synthetic training images into morerealistic training images.

Still more particularly, embodiments of the present invention relate tosystems and methods for improving the training of machine learningsystems to recognize certain objects within a given image, where thecreation of realistic training images includes pairing certainreal-world training images with certain synthetic training images andwhere the pairing enables a convolutional autoencoder to translate a setof input synthetic training images into a corresponding output set ofmore realistic training images.

Even more particularly, embodiments of the present invention relate tosystems and methods for improving the training of machine learningsystems to recognize certain objects within a given image, where thevolume of a realistic training image data set is further enriched by agenerative adversarial network (“GAN”) before being provided as trainingdata to a machine learning system.

BACKGROUND

Machine learning is a subfield of computer science that gives computersan ability to recognize certain patterns of data without beingexplicitly programmed to do so. Machine learning algorithms typicallyoperate by building a computational model to recognize patterns of databy training the model with a set of example patterns collected fromreal-world efforts. However, in certain machine learning applications,it can be difficult or infeasible to collect a sufficient number and/orvariation of high-quality real-world training examples to adequatelytrain a machine learning system. In these situations, machine-learningsystems can be trained with synthetic computer-generated examples.Synthetic examples are often inadequate, however. Almost always, thereis a significant difference between an actual real-world data exampleand a synthetic computer-generated example, and that difference can beimportant for training a machine learning system.

Object recognition, as described in this invention, is the act ofrecognizing given objects in an image. Part of the object recognitiontask is classifying objects as particular types. Therefore,“classification,” as it relates to machine learning, is described herein more detail as follows.

“Classification” is a task in the field of machine learning whereunknown or unclassified objects or images of objects are grouped intocollections of known or classified objects or images of objects. Theknown collections are called “classes.” Each class is denoted by theterm c_(n), where c_(n) ⊆C (C is the set of all classes c_(n)) and eachc_(n) has a set of features ƒ_(m) where ƒ_(m)⊆F. Given C, there shouldbe enough distinction between c₀, c₁, . . . c_(n) such that a set oflines exists that can divide c₀, c₁, . . . c_(n) from each other. Thisquality of distinction is called linear separability. Classes that arelinear separable are those that can be separated by straight lines.Classes that are not linear separable are those that cannot be separatedby straight lines. Instead, classes in C can be separated by non-linear(e.g., curved) boundaries. FIG. 17 and FIG. 18 illustrates thedifference between classes that are linear separable (FIG. 17) andclasses that are not linear separable (FIG. 17). Both FIG. 17 and FIG.18 are further described below.

A common way to identify or classify data is to use a machine learningtechnique. In machine learning, computers are used to classify, predictor infer new facts based on an internal representation. Thisrepresentation is quite often based on training a machine learningalgorithm using existing data that bears some similarity to the datathat is unknown.

Training a classifier (e.g., a machine learning system that willclassify or recognize objects) entails defining an appropriate machinelearning algorithm, defining that algorithm's parameters, andestablishing a data set that will best represent the space of objects tobe classified. For example, if a goal is to classify types of pets, thetraining data should contain a sufficient number of examples of petsthat will need to be classified. A classifier that is missing examplesof fish but has examples of dogs and cats, may not be able tosufficiently classify fish.

Given a data set of images, to use these images for a given task, forexample, the task of classification, the quality of the classificationis a function of the quantity of each type of object in the image setused to train the image classifier.

The quality of the data is also important. For example, if the goal ofthe classifier is to classify images of pets, if the set of imagescontains image samples for a particular type that is not clean, then theclassifier may not accurately classify unseen pets of that type. Anunclean image might include characteristics such as the following:

Noisy Backgrounds—Lots of other information in the image that cluttersthe image;

Obfuscation—The actual object is obfuscated in some way (e.g., it is ina shadow);

Distortion—The actual object is distorted in some way; or

Bad Perspective—The perspective of the actual object is not consistentwith samples to be classified.

Therefore, given a data set of images, to use these images for a giventask, for example, the task of classification, the quality of theclassification is a function of quantity of each type of object and theoverall quality of the image set used to train the image classifier.

Given a data set that is not sufficiently sized or contains trainingsamples that underrepresent the actual objects that need to beclassified, then different measures may be taken to overcome theseproblems:

1. De-noise or declutter the image.

2. Extract just the main object from the image and create new imagescontaining just that object, for example, a dog.

3. Create duplicates of images in the training data set that areconsidered ‘good’ representatives of the types of objects that will beclassified.

4. Take these ‘good’ representatives and change them just enough to makethem different from the original.

5. Use data from other data sources to supplement the training data set.

6. Take images from a different data set that are perhaps similar insome way and make them look like the images in the training data set.

Exploring the use of ‘good’ candidate images and creating duplicatesthat are transformed enough to call them different can producetransformed images that can be used to supplement the data set.Transformations can include applying simple mathematical operations tothe images, histogram modifications, interpolation, rotations,background removal, and more.

Such transformations can improve classification under certaincircumstances but performance gains are often minimal.

More importantly, data acquisition typically implies more cost, as dataacquisition and data preparation can be costly. Synthetic approaches anddata transformations also result in higher cost with usually a lowerpayoff as these methods are inferior to the true data samples. Adaptingother data sets to mimic the data required to train the machine learningmethod, again implies high costs, as this process involves somemechanism to create the synthetic examples.

A key ingredient for deep learning is the data. Deep learning algorithmstend to require more data than the average machine learning algorithm inorder to learn data representations and features.

Machine Translation

Machine translation (“MT”) [see endnote 2] is part of ComputationalLinguistics, a subset of Natural Language Processing, and it implies amachine is assisting with the translation of either text or speech fromone language to another. MT can involve simple word substitution andphrase substitution. Typically, statistical methods are used to performMT. In certain embodiments, we apply machine translation to images bymeans of deep model translation.

Autoencoders

An autoencoder [see endnote 3] is an unsupervised neural network thatclosely resembles a feedforward non-recurrent neural network with itsoutput layer having the same number of nodes as the input layer. Withinthe autoencoder, the dimensionality of the data is reduced to a sizemuch smaller than the original dimensions. This reduction is oftencalled a “bottleneck.” The encoder flattens or compresses the data intothis smaller bottleneck representation. The decoder then tries torecreate the original input from this compressed representation,producing a representation that is equal to the size of the input andsimilar to the original input. The better the performance of theautoencoder, the closer the recreated output is to the original input.

Formally, within an autoencoder, a function maps input data to a hiddenrepresentation using a non-linear activation function. This is known asthe encoding:

z=ƒ(x)=s _(ƒ)(W _(x) +b _(z)),

where the function ƒ maps input x to a hidden representation z, s_(ƒ) isa non-linear activation function, and W and b represent the weights andbias.

A second function may be used to map the hidden representation to areconstruction of the expected output. This is known as the decoding:

y=g(z)=s _(g)(W′ _(z) +b _(y)),

where g maps hidden representation z to a reconstruction of y, s_(g) isa non-linear activation function, and W and b represent the weights andbias.

In order for the network to improve over time, it minimizes an objectivefunction that tries to minimize the negative log-likelihood:

AE(θ)=Σ_(x∈D) _(n) L(x,g(ƒ(x))),

where L is the negative log-likelihood and x is the input.

There are different variants of autoencoders: from fully connected toconvolutional. With fully connected autoencoders, neurons contained in aparticular layer are connected to each neuron in the previous layer. (A“neuron” in an artificial neural network is a mathematical approximationof a biological neuron. It receives a vector of inputs, performs atransformation on them, and outputs a single scalar value.) Withconvolutional layers, the connectivity of neurons is localized to a fewnearby neurons in the previous layer. For image based tasksconvolutional autoencoders are the standard. In embodiments of thisinvention, when autoencoders are referenced, it is implied that theconvolutional variant may be used.

Generative Adversarial Networks (“GANS”)

A generative adversarial network (“GAN”) [see endnote 1] is a networkmade of two deep networks. The two networks can be fully connected whereeach neuron in layer l is connected to every neuron in layer l−1, or caninclude convolutional layers, where each neuron in layer l is connectedto a few neurons in layer l−1. The GANs used in embodiments of theinvention encompass a combination of fully connected layers andconvolutional layers. One of the networks is typically called thediscriminative network and the other is typically called the generativenetwork. The discriminative network has knowledge of the trainingexamples. The generative network does not, and tries to ‘generate newsamples,’ typically beginning from noise. The generated samples are fedto the discriminative network for evaluation. The discriminative networkprovides an error measure to the generative network to convey how ‘good’or ‘bad’ the generated samples are, as they relate to the datadistribution generated from the training set.

Formally, a generative adversarial network defines a model G and a modelD. Model D distinguishes between samples from G and samples h from itsown distribution. Model G takes random noise, defined by z, as input andproduces a sample ĥ. The input received by D can be from ĥ or h. Model Dproduces a probability indicating whether the sample is input that fitsinto the distribution or not.

Variations of the following objective function are used to train bothtypes of networks:

${\min\limits_{G}\; {\max\limits_{D}\; {_{h \sim p_{{Data}{(h)}}}\left\lbrack {\log \; {D(h)}} \right\rbrack}}} + {_{z \sim p_{{Noise}{(z)}}}{\log \left( {1 - {D\left( {G(z)} \right)}} \right)}}$

SUMMARY OF THE INVENTION

Embodiments of the invention provide a way to reduce the cost of dataacquisition for image recognition tasks by improving training through anaugmentation process that involves supplementing and/or augmentingtraining image data sets with automatically generated synthetic imagesusing two distinct and complementary methods: translation andgeneration. Both of these methods are incorporated into embodiments ofthe invention that we call Deep Image Model Translation Generation(“DMTG”).

First by translating images that are statistically different,embodiments of invention can use image translation to improve imagerecognition tasks by using translated images to supplement the targettraining data set, thus converting a sparse training data set into acomparatively dense training data set. FIG. 12, described in more detailbelow, illustrates the interactions between image translation and objectrecognition.

Image translation applies the theory of machine translation to translateimages of one data set (input data set) into images of another data set(target data set). This technique can be used to translate images oflower quality or lower resolution into images of a higher quality orhigher resolution. It can also be used to translate noisy images withclutter to images without clutter. It is a revolutionary way to improvethe overall quality and quantity of the training data set to be used forclassification.

Second, embodiments of the invention can significantly increase the sizeof a training data by generating images that are similar to each imagein the training data set. FIG. 13, described in more detail below,illustrates the interactions between image generation and objectrecognition.

Typical image generation methods used to supplement training data setsinclude duplication and transformation. These methods offer minimalperformance gain.

Image generation embodiments assume, given some point in space, i.e., animage, other images exist that are unknown, or not seen, which arevariations of that same image. These embodiments find other imagessurrounding a ‘known’ image and generate images around that known imageto supplement the image data set. The result is that tens of thousandsof images based on each ‘known’ image can be created, hence providing atruly innovative way to supplement the training data set.

Image translation and generation work together to improveclassification, and hence object recognition, further. FIG. 14,described in more detail below, illustrates how image translation andgeneration can be used to support object recognition tasks.

Translated images or existing images contained in the training data setcan be used to improve the quality of the generated images by initiatingthe generative process with this known knowledge.

In addition, generated images have the potential to be used as afeedback into the translation process to move the translated imagescloser to the true images of the original training data distribution.

Together these methods are able to convert images that are inferior insome way to the target image data set, into images that are similarenough to the target image data set that they could be used tosupplement the original training image data set, thus enabling aresulting object recognition method to reach higher performance levelswith significantly less real-world training data.

The acquisition of large image data sets can be costly. The time toacquire and preprocess real-world data in order for the data to besuitable for an image recognition task involves many hours of work. Theembodiments described herein can reduce the amount of real-world datarequired for object recognition without compromising performance.

The above summaries of embodiments of the present invention have beenprovided to introduce certain concepts that are further described belowin the Detailed Description. The summarized embodiments are notnecessarily representative of the claimed subject matter, nor do theylimit or span the scope of features described in more detail below. Theysimply serve as an introduction to the subject matter of the variousinventions.

BRIEF DESCRIPTION OF THE DRAWINGS

So that the manner in which the above recited summary features of thepresent invention can be understood in detail, a more particulardescription of the invention may be had by reference to embodiments,some of which are illustrated in the appended drawings. It is to benoted, however, that the appended drawings illustrate only typicalembodiments of this invention and are therefore not to be consideredlimiting of its scope, for the invention may admit to other equallyeffective embodiments.

FIG. 1 is a block diagram illustrating how an object recognizer given awell-balanced sufficiently sized training data set may produce accuracyrates that increase as the number of training examples increase, whenrecognizing objects in an unknown set of images.

FIG. 2 is a block diagram illustrating how an object recognizer given asparse unbalanced training data set may produce accuracy rates that arelower for most images of an unknown set of images, even as the number oftraining examples increases, particularly for objects corresponding totypes of classes that were under-represented in the training data set.

FIG. 3 is a block diagram illustrating how an object recognizer givenmissing classes, sparsity, and unbalanced training data may produceaccuracy rates that are low for the majority of images recognized, givenan unknown set of images. Even as the number of training examplesincreases, accuracy rates tend to remain low.

FIG. 4 is a block diagram illustrating how an object recognizer'sperformance can be improved by using image translation in addition tothe original training data set.

FIG. 5 is a block diagram illustrating how an object recognizer'sperformance can be improved by using image generation in addition to theoriginal training data set.

FIG. 6 is a block diagram illustrating an embodiment comprising traininga DMTG translator.

FIG. 7 is a block diagram illustrating an embodiment comprising traininga DMTG generator using noise as an initialization.

FIG. 8 is a block diagram illustrating an embodiment comprising traininga DMTG generator using translated images as an initialization.

FIG. 9 is a block diagram illustrating an embodiment comprising traininga DMTG generator using synthetic images as an initialization.

FIG. 10 is a block diagram illustrating an embodiment comprisingtraining a DMTG translator using activations from an object recognizer.

FIG. 11 is a block diagram of an embodiment illustrating the process ofrecognizing objects using the object recognizer.

FIG. 12 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using translated images in addition totraining images.

FIG. 13 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using generated images in addition totraining images.

FIG. 14 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using translated images and generatedimages in addition to training images.

FIG. 15 illustrates an example result of a translation process from asynthetic image of a cat to a photo realistic image of a cat, accordingto an embodiment of the invention.

FIG. 16 illustrates an example translation process where a set of imageson one manifold are translated into a set of images which exist on adifferent manifold, according to an embodiment of the invention.

FIG. 17 illustrates linear separable objects in a space, where theobjects have been grouped into three different classes.

FIG. 18 illustrates linear inseparable objects in a space, where theobjects have been grouped into three different classes.

FIG. 19 illustrates a translation process where one domain of images istranslated into another domain of images, according to an embodiment ofthe invention.

FIG. 20 illustrates a translation process where one domain of images istranslated into another domain of images, according to an embodiment ofthe invention.

FIG. 21 illustrates an image generation process using a GenerativeAdversarial Network (GAN), according to an embodiment of the invention.

FIG. 22 is a block diagram illustrating the image generation processusing multiple Generative Adversarial Networks (GANs), according to anembodiment of the invention.

FIG. 23 is a block diagram illustrating a variation of training atranslator that includes unpaired images, according to an embodiment ofthe invention.

FIG. 24 shows an even further advancement by enriching the translationprocess with generated images, according to an embodiment of theinvention.

FIG. 25 shows the effects of image generation on a set of 3 classes ofimages, according to an embodiment of the invention.

FIG. 26 is a block diagram of an exemplary embodiment of a ComputingDevice in accordance with the present invention.

DESCRIPTION OF THE EMBODIMENTS

Embodiments of the present invention will be described with reference tothe accompanying drawings, wherein like parts are designated by likereference numerals throughout, and wherein the leftmost digit of eachreference number refers to the drawing number of the figure in which thereferenced part first appears.

Overview

One goal of Deep Image Model Translation Generation (“DMTG”) is toimprove the training of a machine learning system by generating a largevolume of synthetic training images using deep image translation and/ordeep image generation techniques, where the process of obtainingtraining images from real-world sources would be difficult and/orcostly. Embodiments of the invention address three main problems: (1)lack of usable data to sufficiently train an object recognizer, therebyproducing lower accuracies when using the object recognizer to detectobjects; (2) certain classes of objects may not be sufficientlyrepresented in the training data set, which may produce an objectrecognizer that cannot sufficiently recognize all objects for which itwas intended to recognize; and (3) poor quality of training data mayproduce a poorly trained object recognizer that results in loweraccuracy rates.

To sufficiently recognize objects in an image, the various classes ofobjects to be recognized must be sufficiently represented in thetraining data set. Large image-based training data sets are often hardto acquire. Even under the best of circumstances a significant amounteffort must be exerted to prepare data sets such that they can be usedto train a machine learning system. If relevant training data is notavailable, then synthetic data may be adapted and substituted for thereal data. Data that is labeled, i.e., the objects are identified asbelonging to a specific category, providing the machine learning systemwith ‘ground truth’, can also present an acquisition challenge. Withouta sufficiently sized training data set, performance can be hindered whenapplying the trained machine learning system to unseen or unknown imagesfor object identification.

In other cases, training data sets may be considered unbalanced orbiased towards certain classes. Hence object recognition performancewill be biased more toward the classes that are more sufficientlyrepresented than the classes that are less sufficiently represented.

FIG. 1 is a block diagram illustrating how an object recognizer 103 thatis given a well-balanced sufficiently sized training data set 101 canproduce a plot of accuracy rates 105 (object recognition performance)that increase as the number of training examples 109 increase, whenrecognizing objects in an unknown set of images. Points 109 show loweraccuracy given a smaller number of training examples. As the number ofexamples increases 107, the accuracy will increase because the number ofclasses will be more sufficiently represented. For deep learningproblems, this is of particular importance, as deep learning algorithmstend to require significantly more training data than a typicalsupervised machine-learning algorithm.

FIG. 2 shows how accuracy may be hindered by sparse or unbalancedtraining data. FIG. 2 is a block diagram illustrating how an objectrecognizer 203 given a sparse unbalanced training data set 201 canproduce a plot of accuracy rates 205 that are lower for most images ofan unknown set of images, even as the number of training examplesincreases, particularly for objects corresponding to types of classesthat were under-represented in the training data set. Given a trainingdata set 201 that is sparse and/or unbalanced, the performance 205 ofthe object recognizer 203 is likely to not reach accuracy levels reachedin FIG. 1 even as the number of examples increase. Hence the majority ofimages to be classified will be incorrectly classified 209 and very fewwill be accurately classified 207. Points 209 represent lowerrecognition accuracy due to an under-representation of class examples.Points 207 show that even as the number of training examples increases,the accuracy will remain low for under-represented classes.

Certain classes of data may not be represented at all. FIG. 3 is a blockdiagram illustrating how an object recognizer 303 given missing classes,sparsity, and unbalanced training data 301 may produce accuracy rates305 that are low for the majority of images recognized, given an unknownset of images. As shown in FIG. 3, the lack of certain classes can beworse than unbalanced training data, as in this situation the trainingdata set 301 will have certain classes that are not represented at all.Hence the object recognizer, 303 will fail to recognize those classes.Even as the number of training examples increases, accuracy rates tendto remain low, because the majority of unrepresented classes are notlikely to be accurately classified. Points 307 shows this is the mostunsuitable type of data set, as classes may have zero or very fewtraining examples.

In addition to missing or inadequately represented classes, the qualityof the training data could be problematic. If images are of a poorquality, this could affect the performance of the object recognitionmethod. Poor quality data could mean the objects are blurry, obfuscatedin some way, or hard to identify due to the perspective of the object inthe image. In addition, there could be significant noise in the image orother objects that affect the overall quality.

The best way to overcome issues such as these is to require lesstraining data but achieve the same (or better) object recognitionperformance.

Embodiments of the present invention provide two methods to achieveobject recognition performance by requiring less training data andovercoming the issues often found among training data.

The first method described—Deep Model Image Translation—enablestranslation from one object to another. The translation can be appliedto a set of images from one domain to be translated to a differentdomain as shown in FIG. 15 where drawings of cats 1501 can be translated1503 as photo realistic images of cats 1505. The translation can beapplied to a set of images of a particular class translated to a set ofimages of another class as shown in FIG. 16 where 1601 represents domainto translate from, 1603 is the translation process and 1605 representsthe domain to be translated to. This in particular addresses data setsparsity, unbalanced class representation and data set adaptation.

The second method described—Deep Model Image Generation—enables thegeneration of images similar to a given image. The generated images arenot the original images with some global transformation applied. Rather,the generated images are new images that were not previouslyrepresented. FIG. 25 depicts a data set with a sparsely defined class2503 and examples of ‘generated’ images (2509, 2511, 2513, 2515, 2517,2519). This embodiment takes advantage of the theory of image manifolds[see endnote 5] and generates images that are similar to a given imagewith slight variations.

Both of the aforementioned methods—Deep Model Image Translation and DeepModel Image Generation—can be used to improve the object recognitionprocess. In FIG. 11, object recognition in its basic form is shown.Given an image 1103, the object recognizer 1105 which is a deep neuralnetwork, tries to identify an object 1109 in image 1103 based on theclasses 1107 it has been trained to recognize.

In one embodiment, as shown in FIG. 4, a translator 403 can be used togenerate a set of translated images 413. When an object recognizer usesreal-world training images and translated images together, higheraccuracy rates 419 can be achieved with fewer samples of the originaldata set 417. This method is particularly suitable for methods thatrequire very large amounts of data, for example deep learning neuralnetworks (a subset of machine learning) but the technique could beapplied to any machine learning problem. The basic process flow fortraining the translator, as shown in FIG. 6 is for a set of syntheticimages 603 to be paired 605 with training images 601. The translator 609is then trained to produce a network that can then be called upon withany unpaired synthetic image to produce a translated version of theimage. As depicted in FIG. 12 an object recognizer 1219 can be trainedusing the translated images in addition to a set of training images. InFIG. 6, two different data sources are shown, but as mentioned they canoriginate from the same data source. Also note the training images 601in FIG. 6 may not be the same set of training images 1213 in FIG. 12.

As depicted in FIG. 19, a translation process can have two phases.First, an encoder can encode an image into a hidden representation. Inthe hidden representation, the dimensionality of the encoded image isreduced, which enables certain features of the image to be identified.Then, the encoded representation is decoded into a complete image,completing the translation. This translation process can be performed onan entire set of training images, which enables the network to learn thetranslation. This training process is based on the pairing mechanismsillustrated in FIG. 6. By using an image pairing technique 605 duringtraining, whereby images from two domains (603, 601) are paired, atranslator 609 can be trained to encode the images from one domain intoimages from another domain. For example, as shown in FIG. 15 anautoencoder might be trained to encode poor quality images of cats(e.g., 1501) into photo realistic images of cats (e.g., 1505).

Deep Model Image Translation

FIG. 4 is a block diagram illustrating how an object recognizer'sperformance can be improved by using image translation in addition tothe original training data set. DMTG 401 includes a translator 403comprising an autoencoder 405, which includes an encoder 407 and decoder409. The encoder 407 compresses the data whereby N dimensions are mappedto M dimensions and M<N. The decoder 409 uncompresses the data mappingthe M dimensions to N dimensions using a multi-layered convolutionalneural network. The original training images 411 and translated images413 created by 401 are used to train the object recognizer 415. Theresults of performing object recognition on an unknown data set areshown on graph 417, which plots the objects recognized given the numberof training examples and the accuracy. Points 419 show that the objectrecognizer 415 is able to achieve high accuracy rates with fewreal-world (original) training examples because the object recognizer415 was trained with supplemental translated images 413.

FIG. 5 is a block diagram illustrating how an object recognizer'sperformance can be improved by using image generation in addition to theoriginal training data set. DMTG 501 includes a generator 503 comprisinga GAN 505, which includes a discriminative neural network 507 and agenerative neural network 509. The original training images 511 andgenerated images 513 created by 501 are used to train the objectrecognizer 515. The results of performing object recognition on anunknown data set are shown on graph 517, which plots the objectsrecognized given the number of training examples and the accuracy.Points 519 on graph 517 show that the object recognizer 515 is able toachieve high accuracy rates with few training examples because theobject recognizer 515 was trained with the supplemental generated images513.

FIG. 6 is a block diagram illustrating an embodiment comprising traininga DMTG translator. Given a relatively sparse set of training images 601and a set of separately created synthetic images 603, a set of imagepairings 605 indicates which synthetic images 603 align withcorresponding training images 601. The pairings 605 are input to thetranslator 609, which is part of DMTG 607. The translator 609 comprisesan autoencoder 611, which includes an encoder 613 and decoder 615.Encoder 613 compresses the data whereby N dimensions are mapped to Mdimensions and M<N. Decoder 615 uncompresses the data mapping the Mdimensions to N dimensions using a multi-layered convolutional neuralnetwork. Each time an image from synthetic images 603 goes through theencoder 613 and decoder 615, the network in translator 609 learns how totake the compressed information based on a synthetic image 603 togenerate a training image such as the training images in 601 based onpairings 605. Once trained, DMTG 607 can be called upon with anyunpaired synthetic image to produce a translated version of the image.The output of the translator 609 is a comparatively dense set oftranslated images 617.

Deep Model Generation

FIG. 7 is a block diagram illustrating an embodiment comprising traininga DMTG generator using noise as an initialization. The DMTG 707 includesa generator 709, which is composed of a GAN 711 comprising adiscriminator neural network 713 and a generative neural network 715.Given a relatively sparse set of training images 701 as input into thediscriminator neural network 713 and Gaussian noise 705 and defined bythe mathematical equation 703 as input into the generative neuralnetwork 715, the two networks in the GAN 711 work together to train thegenerator neural network 715 to produce images that are representativeof the training examples 701 used to train the discriminative network713. The discriminative neural network 713 is aware of the originaltraining images 701. The generative neural network 715 is not given datato begin with; instead it is initialized with Gaussian noise 705. Thetwo networks play an adversarial game where the generator 715 generatesimages 717 that are then processed by the discriminator 713. Thediscriminator 713 classifies the generated images 717 as either similarto the training data set 701 or not. The generator 715 may use thisinformation to learn how to generate new images 717 that are better thanthe last iteration of generated images. The output from thediscriminator 713, may be used directly in the loss function for thegenerator 715. The loss function changes how the generator 715 willgenerate the next set of images. Once trained, DMTG 707 can be calledupon to produce generated images. The output is a comparatively denseset of generated images 717.

FIG. 8 is a block diagram illustrating an embodiment comprising traininga DMTG generator using translated images as an initialization. The DMTG807 includes a generator 809, which is composed of a GAN 811 comprisinga discriminator neural network 813 and a generative neural network 815.Given a relatively sparse set of training images 801 as input into thediscriminator network 813 and translated images 803 as input into thegenerative network 815 (note this input 805 is different than Gaussiannoise), these two networks in the GAN 811 work together to train thegenerator neural network 815 to produce images that are representativeof the training examples 801 used to train the discriminative network813. The discriminative neural network 813 is aware of the originaltraining images 801. The generative neural network 815 is not given datato begin with; instead it is initialized with translated images 803. Thetwo networks play an adversarial game where the generator 815 generatesimages 817 that are then processed by the discriminator 813. Thediscriminator 813 then classifies the generated images 817 as eithersimilar to the training data set 801 or not. The generator 815 may usethis information to learn how to generate new images 817 that are betterthan the last iteration of generated images. The output from thediscriminator 813, may be used directly in the loss function for thegenerator 815. The loss function changes how the generator 815 willgenerate the next set of images. Once trained, DMTG 807 can be calledupon to produce generated images. The output is a comparatively denseset of generated images 817.

FIG. 9 is a block diagram illustrating an embodiment comprising traininga DMTG generator using synthetic images as an initialization. The DMTG907 includes a generator 909, which is composed of a GAN 911 comprisinga discriminator neural network 913 and a generative neural network 915.Given a relatively sparse set of training images 901 as input into thediscriminator network 913 and set of separately created synthetic images903 as input into the generative network 915 (note this input 905 isdifferent than Gaussian noise), these two networks work together totrain the generator network 915 to produce images that arerepresentative of the training examples 901 used to train thediscriminative network 913. The discriminative neural network 913 isaware of the original training images 901. The generative neural network915 is not given data to begin with; instead it is initialized withsynthetic images 903. The two networks play an adversarial game wherethe generator 915 generates images 917 that are then processed by thediscriminator 913. The discriminator 913 classifies the generated images917 as either similar to the training data set 901 or not. The generator915 may use this information to learn how to generate new images 917that are better than the last iteration of generated images. The outputfrom the discriminator 913, may be used directly in the loss functionfor the generator 915. The loss function changes how the generator 915will generate the next set of images. Once trained, DMTG 907 can becalled upon to produce generated images. The output is a comparativelydense set of generated images 917.

Deep Model Image Translation with Activations

FIG. 10 is a block diagram illustrating an embodiment comprisingtraining a DMTG translator using activations from an object recognizer.Given a relatively sparse set of training images 1001 and a set ofseparately created synthetic images 1003, image pairs (or pairings) 1005indicate which synthetic images 1003 align with which training images1001. The pairs 1005 are the input to a translator 1009, which is acomponent of DMTG 1007. The translator 1009 comprises an autoencoder1011, which consists of an encoder 1013 and decoder 1015. The encoder1013 compresses training image data whereby N dimensions are mapped to Mdimensions and M<N. The decoder 1015 uncompresses the data, mapping theM dimensions to N dimensions using a multi-layered convolutional neuralnetwork. Each time an image from synthetic images 1003 goes throughencoder 1013 and decoder 1015, the network within the autoencoder 1011learns how to take the compressed information based on 1003 to generatethe image from 1001 based on pairing 1005. The output of the translator1009 is a comparatively dense set of translated images 1019. As theobject recognizer 1017 trains its own deep network, activations 1021(the internal representation of the input) from the object recognizer1017 are fed back into the autoencoder 1011 and used to further optimizethe autoencoder.

Using Translation and Generation to Improve Object Recognition

FIG. 11 is a block diagram of an embodiment illustrating the process ofrecognizing objects using an object recognizer 1105 executing on acomputing device 1111 such as Computing Device 2600 (see FIG. 26). Givena new image 1103 obtained from an optional external image sensor 1101,where the new image 1103 comprises an object 1109 (for example, the catshown on the right side of image 1103), object recognizer 1105, oncetrained with a sufficiently dense set of training images 1102, will beable to recognize object 1109 in the image 1103 and produce an objectlabel 1107 corresponding to the class of the object 1109 (for example,in FIG. 11, the class could be “cat”). Training images 1102 couldcomprise translated images 617, generated images 717, generated images817, generated images 917, translated images 1019, translated images1217, training images 1213, generated images 1317, training images 1313,translated images 1427, generated images 1425, training images 1423,training images 2147, training images 2249, translated images 2317,and/or translated images 2427. Object recognizer 1105 could be the sameobject recognizer as object recognizer 103, object recognizer 203,object recognizer 303, object recognizer 415, object recognizer 415,object recognizer 515, object recognizer 1017, object recognizer 1105,object recognizer 1219, object recognizer 1319, or object recognizer1429.

FIG. 12 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using translated images in addition totraining images. During training of the object recognizer 1219, arelatively sparse set of training images 1213, and a relatively denseset of translated images 1215 are used. The translated images 1215 areproduced by the DMTG 1203, and specifically the translator 1205 after ithas been trained according to methods described herein. The translator1205 includes an autoencoder 1207, which may contain an encoder 1209 anda decoder 1211. The encoder 1209 compresses the data whereby Ndimensions are mapped to M dimensions and M<N. The decoder 1211uncompresses the data mapping the M dimensions to N dimensions using amulti-layered convolutional neural network. The translator may be usedto populate a relatively dense database of translated images 1215.

FIG. 13 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using generated images in addition totraining images. During training of the object recognizer 1319, arelatively sparse set of training images 1313 and a relatively dense setof generated images 1317 are used. The generated images 1317 areproduced by the DMTG 1303, and specifically the generator 1305 after ithas been trained according to methods described herein. The generatormay contain a generative adversarial network (GAN) 1307, which maycontain a discriminative network 1309 and a generative network 1311. Thegenerator 1311 may be used to populate a relatively dense database ofgenerated images 1317.

FIG. 14 is a block diagram of an embodiment illustrating the process oftraining an object recognizer using translated images and generatedimages in addition to training images, according to an embodiment of theinvention. During training of the object recognizer 1429, a relativelysparse set of training images 1421, a relatively dense set of translatedimages 1425 and a relatively dense set of generated images 1423, areused. The translated 1425 and generated 1423 images are produced by DMTG1403 after its translator 1405 and generator 1415 have been trainedaccording to methods described herein. The translator 1405 may producetranslated images 1425, and the generator 1415 may produce generatedimages 1425. The translator 1405 may contain an autoencoder 1407, whichmay contain an encoder 1409 and a decoder 1411. The translator may beused to populate a relatively dense set of database of translated images1425. The generator 1415 may contain a generative adversarial network(GAN) 1417 which may contain a discriminative network 1419 and agenerative network 1421. The generator network 1421 may use thisinformation to learn how to generate new images 1425 that are betterthan the last iteration of generated images. The generator 1415 may beused to populate a relatively dense database of generated images 1425.

FIG. 15 illustrates an example result of a translation process 1503 froma synthetic image of a cat 1501 to a photo realistic image of a cat1505, according to an embodiment of the invention. Translation process1503 may be performed by a translator such as translator 609 (see FIG.6).

FIG. 16 illustrates an example translation process where a set of images1601 on one manifold are translated by translation process 1603 into aset of images 1605 that exist on a different manifold, according to anembodiment of the invention. In this case a set of cats with frontalviews 1601 are translated 1603 into images of cats with open mouths1605. Again, translation process 1603 may be performed by a translatorsuch as translator 609 (see FIG. 6).

FIG. 17 illustrates linear separable objects in a space, where theobjects have been grouped into three different classes 1701, 1703, and1705. Class 1703 comprises rectangles. Class 1701 comprises ellipses.Class 1705 comprises triangles. Because classes 1701, 1703 and 1705 inFIG. 17 can be separated from each other by straight lines, theseclasses are linear separable.

FIG. 18 illustrates linear inseparable objects in a space, where theobjects have been grouped into three different classes 1801, 1803, and1805. Class 1801 comprises rectangles. Class 1805 comprises ellipses.Class 1803 comprises triangles. Because classes 1801, 1803, and 1805 inFIG. 18 cannot be separated from each other by straight lines, theseclasses are not linear separable.

FIG. 19 illustrates a translation process where one domain of images1901 is translated into another domain of images 1917, according to anembodiment of the invention. Assuming an already trained translatorexists, such as the translator 609 in FIG. 6, and assuming a set oftraining images 601 and a set of synthetic images 603, with image pairs605 indicating which synthetic images 603 align with which trainingimages 601, the pairings 605 are the input into the translator 609within DMTG 607. The translator 609 consists of an autoencoder 611,which consists of an encoder 613 and decoder 615. The encoder 613compresses the data whereby N dimensions are mapped to M dimensions andM<N. The decoder 615 uncompresses the data mapping the M dimensions to Ndimensions using a multi-layered convolutional neural network. Each timean image from 603 goes through the encoder 613 and decoder 615, thenetwork within autoencoder 611 learns how to take the compressedinformation based on synthetic images 603 to generate a correspondingimage from 601 based on pairing 605. Now referring to FIG. 19, in domain1901 there are three different classes of objects: 1903, 1905, and 1907.Class 1903 comprises rectangles. Class 1905 comprises triangles. Class1907 comprises ellipses. In domain 1917 there are three differentclasses of objects: 1919, 1921, and 1923. Class 1919 comprisesrectangles. Class 1923 comprises triangles. Class 1921 comprisesellipses. Using a trained translator such as translator 609, image 1909of class 1905 is encoded into a hidden representation h 1913, thendecoded into an image 1915 that is similar to the images in class 1923.As image 1909 goes through the translation process 1911, the image 1909is encoded and reduced to a lower dimension by the encoder 613, which isrepresented by a hidden representation 1913, and then the image isdecoded by the decoder 615 to produce the image 1915. For each image in1901, a translation 1911 will proceed that will convert the images fromdomain 1901 to images that are similar to images in domain 1917. Forexample, images from class 1907 (ellipses) will be converted to imagessimilar to 1921 (ellipses). Images from 1903 (rectangles) will beconverted to images similar to 1919 (rectangles). Images from 1905(triangles) will be converted to images similar to 1923 (triangles).

FIG. 20 illustrates the translation process where one domain of images2001 is translated into another domain of images 2033, according to anembodiment of the invention. In domain 2001 there are three differentclasses of objects: 2003, 2005, and 2007. Class 2003 comprisesrectangles. Class 2007 comprises triangles. Class 2005 comprisesellipses. In domain 2033 there are also three different classes ofobjects: 2035, 2037, and 2039. Class 2035 comprises rectangles. Class2039 comprises triangles. Class 2037 comprises ellipses. Assuming analready trained translator, such as translator 609 as described in FIG.6, a particular image 2017 of class 2017 is translated into an image2023 that is similar to the images in class 2039. As image 2017 goesthrough the translation process 2021, the image is encoded by anencoder, such as encoder 613, and reduced to a lower dimension which isrepresented by a hidden representation 2019 and then decoded by adecoder, such as decoder 615, to produce the image 2023. In thisdiagram, additional transformations can be applied globally 2041 for allimages in domain 2001 before going through the translation process,represented by four different global transformations 2009, 2011, 2013,and 2015. In addition, global transformations 2029, in this diagram fourdifferent global transformations 2025, 2027, 2029, and 2031 are appliedto all images after the translation process. These two types of globaltransformations—either occurring before transformation 2041 where theglobal information becomes part of what is learned by the network, orafter 2029 where the network does not learn the global information—areused in addition to the translation process to allow for differenttransformations. For each image in 2001, a translation 2021 will proceedthat will convert the images from domain 2001 to images similar toimages in domain 2033. For example, images from class 2003 will beconverted to images similar to class 2035. Images from class 2005 willbe converted to images similar to class 2037. Images from class 2007will be converted to images similar to class 2039.

FIG. 21 illustrates the image generation process using a GenerativeAdversarial Network (GAN), according to an embodiment of the invention.In this figure, a single GAN 2101 may contain a discriminative network2103 and a generative network 2015 to generate images for threedifferent classes of images: 2113, 2117, and 2125. Class 2113 isrepresented as rectangles. Class 2117 is represented as triangles. Class2125 is represented as ellipses. The GAN 2101 may be used generateimages of each class. For example, three images, 2107, 2109, and 2111are created for class 2113. One image, 2115, is created for class 2117.Three images, 2119, 2121, 2123, are created for class 2125. As is thecase for class 2113 and class 2125, the GAN 2101 is able to fill in thesparsity for these two under-represented classes. The results of imagegeneration using the GAN 2101 is shown by classes with additional images2139. Class 2125 becomes class 2145. Class 2113 becomes class 2141.Class 2117 becomes class 2143. Images in classes 2143, 2144, and 2145are all are stored in the image generation database 2147.

FIG. 22 is a block diagram illustrating the image generation processusing multiple Generative Adversarial Networks (GANs), according to anembodiment of the invention. In this diagram, a different GAN may beused to generate images for each class. A class 2213 represented asrectangles has assigned a GAN 2201 which may contain a Discriminativenetwork 2203 and a Generative network 2205. The GAN 2201 generates newimages 2207, 2209, and 2211, which are similar to corresponding imagesin class 2213. A class 2221 represented as triangles has assigned a GAN2215 which may contain a Discriminative network 2217 and a Generativenetwork 2219. The GAN 2215 generates a new image 2223, which is similarto corresponding images in class 2221. A class 2231 represented asellipses has assigned a GAN 2225 which may contain a Discriminativenetwork 2227 and a Generative network 2229. The GAN 2225 generates newimages 2233, 2235, 2237, and 2239, which are similar to correspondingimages in class 2231. The results of image generation using the GANs2201, 2215, and 2225 are shown by classes with additional images 2241.Class 2213 becomes class 2243. Class 2221 becomes class 2245. Class 2231becomes class 2247. Images in classes 2243, 2245, and 2247 are stored inthe image generation database 2249.

FIG. 23 is a block diagram illustrating a variation of training atranslator that includes unpaired images, according to an embodiment ofthe invention. In FIG. 6, a translator 609 was shown with pairs 605 ofimages from a synthetic database 603 and a training database 601. Thepairs 605 are used to train the translator 609. In FIG. 23, the set ofsynthetic images 2303 and the relatively sparse set of training images2301 are not paired. The translator 2307 which is part of DMTG 2305 maycontain an autoencoder 2309 which may contain an encoder 2311 and adecoder 2313. The encoder 2311 compresses the data whereby N dimensionsare mapped to M dimensions and M<N. The decoder 2313 uncompresses thedata mapping the M dimensions to N dimensions using a multi-layeredconvolutional neural network. Each time an image from synthetic images603 goes through the encoder 2311 and decoder 2313, the network withinthe autoencoder 2309 learns how to take the compressed information basedon synthetic images 603 to generate a corresponding image from trainingimages 601 based on the corresponding pairing 605. The translator 2307may use an internal distance measure 2315 to find pairs automatically.Translated images are stored in a relatively dense database 2317.

FIG. 24 shows an even further advancement by enriching the translationprocess with generated images 2409. In FIG. 24, a GAN, such as GAN 2101,may be used to enrich a data set with generated images 2401. In 2401 oneclass 2405 of images is represented as rectangles, one class 2407 isrepresented as ellipses, and another class 2403 is represented astriangles. These images are stored in a generated image database 2409.The generated images 2409 are treated as training images in addition tothe pre-existing relatively sparse set of training images 2413, wherebythe combination of these two data sets forms a new training data set2429. In addition, the generated images 2409 are used. DMTG 2415 maycontain a translator 2417, which includes an autoencoder 2419 with anencoder 2421 and a decoder 2423. The encoder 2421 compresses thetraining images 2429 whereby N dimensions are mapped to M dimensions andM<N. The decoder 2423 uncompresses the data mapping the M dimensions toN dimensions using a multi-layered convolutional neural network. It alsohas the ability to internally pair synthetic images 2411 and trainingimages 2413 automatically using a distance-based measure 2417 thatdiscovers similar images. The output of the translator is a relativelydense set of translated images 2427.

FIG. 25 shows the effects of image generation on a data set 2501 having3 classes of images. One class of images is represented as rectangles2503. A second class of images is represented as ellipses 2505. A thirdclass of images is represented as triangles 2507. Given a sparse class2503, an image generator can be used to generate additional imagessimilar to images that exist in that class. In FIG. 25, six images are‘generated’ images (2509, 2511, 2513, 2515, 2517, 2519).

Sparse Data Sets Versus Dense Data Sets

Typically, a set of training images for a given object (or an objectclass), such as training images 101, 201, 301, 411, 511, 601, 701, 801,901, 1001, 1102, 2301, and 2413, will comprise a relatively sparse setof real-world images obtained from actual photographs, as well as fromother means of obtaining images of objects (for example, RADAR). Asmentioned above, embodiments of the present invention relate to systemsand methods for improving the training of machine learning systems torecognize certain objects within a given image, where the number ofreal-world training images is relatively sparse, but the number ofsynthetically derived training images is comparatively dense. In thecontext of the present invention, it is difficult to describe theprecise boundaries of a “sparse” data set versus those of a “dense” dataset. It is difficult because the ultimate goal of training a machinelearning system to recognize a given object is not entirely dependent onthe quantity of training data. The quality of training data is importantas well, as is the variety of different perspectives and viewpoints of agiven object represented in a training data set. For purposes of thepresent invention, however, a “sparse” data set is one that does notcontain enough high-quality variations of an object to train a machinelearning system to effectively recognize the object with acceptablereliability and/or confidence. A “dense” data set, on the other hand, issimply a data set that is not sparse. That is, a dense data set willcontain enough high-quality variations of an object to train a machinelearning system to effectively recognize the object with acceptablereliability and/or confidence. Embodiments of the present invention cantransform sparse data sets into dense data sets by teaching machinelearning systems to create additional high-quality training data usingsparse data sets as a starting point, without the need to acquire morereal-world images, which can often be a costly task that may alsosubject the image-gathering systems and/or personnel to unacceptablerisk. Recognizing the problems associated with defining a precise set ofquantities or limits that would characterize a sparse data set versus adense data set, and putting aside the issue of image quality and otherfactors, for instructional purposes, the number of images in a sparsedata set may often be in the range from tens of images to hundreds ofimages. The number of images in a dense data set, on the other hand, mayrange over one or two thousand images to many thousands of images.

Variations

As depicted in FIG. 5, where DMTG 501 may contain a generator 503 whichis composed of a GAN 505 which has a discriminative neural network 507and a generative neural network 509. The images generated 513 from thegenerator 503 can be used in conjunction with the training images 511 totrain an object recognizer 515. This requires fewer images from thetraining image data set 511 while still achieving high accuracy rates519 when measuring accuracy as a function of the number of originaltraining examples 517. Again, this method is particularly suitable formethods which require very large amounts of data, such as deep learningneural networks (a subset of machine learning).

Given an image that exists in a data set, if that image were transformedeither through rotation, lighting changes, or other sort of distortions,it would still similar to the set of images. Deep generative models takeadvantage of this and can be used to generate new images based ontraining a network from an existing data set. Deep model imagegeneration may be used to supplement a sparse data set by generatingadditional images to ‘fill-in’ for missing data. This method can beguided by class labels but also works as an unsupervised approach.

As depicted in FIG. 7, DMTG 707 consists of a generator 709 which maycontain a generative adversarial network (GAN) 711 which may contain twonetworks, a discriminative network 713 and a generative network 715. Thegenerative network 715 is typically initialized with Gaussian noise 705defined in a manner similar to equation 703. The discriminative network713 is trained based on the set of training images. As the two networksplay the adversarial game, over time they improve and eventually thegenerative network 715 is able to generate images 717 that statisticallyare similar to the training images 701.

A variation of the process depicted in FIG. 7 is shown in FIG. 8 where agenerative adversarial network (GAN) can also be initialized with whatit learned from the autoencoding method. As shown in FIG. 8, DMTG 807consists of a generator 809 which may contain a generative adversarialnetwork (GAN) 811 which may contain two networks, a discriminativenetwork 813 and a generative network 815. The generative network 815 isinitialized in this case with the output from a translation 803. Thediscriminative network 813 is trained based on the set of trainingimages. As the two networks play the adversarial game, over time theyimprove and eventually the generative network 815 is able to generateimages 817 that statistically are similar to the training images 801. Byinitializing with the translation images 803, the generative network 815is able to learn how to produce images similar to the training data 801faster.

Another variation of the process depicted in FIG. 7 is shown in FIG. 9where a generative adversarial network (GAN) can also be initializedwith what it learned from the autoencoding method. As shown in FIG. 9,DMTG 907 consists of a generator 909 which may contain a generativeadversarial network (GAN) 911 which may contain two networks, adiscriminative network 913 and a generative network 915. The generativenetwork 915 is initialized in this case with the synthetic images 903(Note, synthetic images 603 as depicted in FIG. 6 are used with trainingimages 601 to train the translator 609). The discriminative network 913is trained based on the set of training images. As the two networks playthe adversarial game, over time they improve and eventually thegenerative network 915 is able to generate images 917 that statisticallyare similar to the training images 901. By initializing with thesynthetic images 903, the generative network 915 is able to learn how toproduce images similar to the training data 901 faster.

To improve object recognition with fewer training examples, as depictedin FIG. 13, the trained DMTG 1303 Generator 1305 can be used to generateimages. Where the Generator 1305 may contain a GAN 1307 which consistsof two networks, a discriminative network 1309 and a generative network1311. After the Generator 1305 has been trained, it can then be calledupon to generate N samples given an input image resulting in a set ofgenerated images 1317. The Object Recognizer 1319 can be trained usingthe generated images 1317 in conjunction with the training images 1313.

As mentioned, embodiments of the DMTG can be used to ‘fill-in’ sparsedata sets. In FIG. 21, the act of ‘filling-in’ images for sparse classesin shown. Assuming a single GAN 2101 which may contain a Discriminativenetwork 2103 and a Generative network 2015, the GAN can generate imagesfor three different classes of image 2113, 2117, and 2125. For example,the class with rectangles 2113 has only three samples in this example.The GAN 2101 can be called upon to generator more examples such as 2107,2109 and 2111. The new data set 2139 that includes both the originalimages and the generated images form a new training data set 2147.

A variation of FIG. 21 assumes a GAN is created for each class type,depicted in FIG. 22. A class 2213 represented as rectangles has assigneda GAN 2201 which may contain a Discriminative network 2203 and aGenerative network 2205. A class 2221 represented as triangles hasassigned a GAN 2215 which may contain a Discriminative network 2217 anda Generative network 2219. A class 2231 represented as ellipses hasassigned a GAN 2225 which may contain a Discriminative network 2227 anda Generative network 2229. The GAN 2201 can be called upon to generatormore examples such as 2207, 2209 and 2211. The new data set 2241 thatincludes both the original images and the generated images form a newtraining data set 2249.

In addition to the aforementioned benefits, deep image translation anddeep image generation can be used together to improve objectrecognition. In FIG. 14 the output from the Translator 1405 in the formof translated images 1427 in addition to the output from the Generator1415 in the form of generated images 1425 can be used along with thetraining images 1423 to train the Object Recognizer 1429.

Furthermore, a feedback mechanism between the Deep Model ImageGeneration and Deep Model Image Translation processes is depicted byFIG. 24. In FIG. 24 assuming a generated data set 2409 was produced. Thegenerated data set 2409 can be combined with the training images 2413and used with the synthetic images 2411 to train the Translator 2417.The translator 2417 using its internal distance measures 2425 wouldapply automatic pairing in order to use the autoencoder 2419 to encodethe synthetic images 2411 to be similar to the training images 2429 thatwere formed by combining the generated images 2411 with the trainingimages 2413.

When an autoencoder learns an encoding from one image domain to another,that encoding can then be applied to other images. This method tends topreserve spatial relationships between the other data points in the dataset, which often has benefits in training (scale, translation androtational invariance are known to improve the performance of manymachine learning methods in related computer vision tasks). Inparticular, when performing this type of embedding using deeparchitectures, manifold shapes across domains are preserved. In FIG. 16,for example, the frontal cat face “manifold” is “translated” to the openmouth face cat “manifold,” where certain properties are preserved acrossthe manifold, such as the darkness and spatial arrangements of somemarkings on the cats. This concept tends to hold in language-basedtranslations as shown in Natural Language Processing machine translationresearch [see endnote 2]. This method enables class-to-classtranslations as well as cross-domain translations. For instance, it ispossible that translations of black cats to black cats has specialproperties that do not hold for translations of tabby cats to tabbycats.

A key component of this work is pairing images between domains, whereimage pairing is established a priori by domain experts. As depicted inFIG. 23, a translator 2307 has the ability to perform automatic pairingby using a set of distance based measures 2315. Given a set of imagesfrom domain one 2303 and a set of images from domain two 2301, wherebydomain one represents synthetic images 2303 and domain two representsthe original training images 2301, DMTG 2305 and in particular thetranslator 2307 which may contain an autoencoder 2309 is able toapproximate pairs using two different techniques, mean squared error andstructural similarity measures. Using a function that combines thesemethods 2307, DMTG 2305 is able to find approximate pairs between domainone 2303 and domain two 2301.

When using a distance-based metric, careful attention should be made toensure at each iteration in the training process, that a random match ischosen to pair. This forces the autoencoder to learn more variations fora particular input.

A further modification to the Deep Model Image Translation embodiment,as shown in FIG. 10, relates to the loss function, which controls howthe autoencoder 1011 improves its learning process over time. Normally,at each iteration a loss function may be used to calculate how well thenetwork performed on the iteration of training. The results of thisfunction dictate how the network will change on the next iteration.Normally, the loss is calculated internally by the translator 1009.However, in this variation, the activations 1021 which are retrievedfrom the object recognizer 1017, which is also a deep neural network,are used inside of the loss function of the autoencoder 1011. Thiscoordination between the object recognizer 1017 and the translator 1009synchronizes how well the object recognizer performs with how theautoencoder changes its translation process.

Benefits of the Embodiments

The embodiments described benefit the object recognition task byreducing the requirement to acquire large data sets in order to train anobject recognition algorithm. This is of particular importance to deeplearning object recognition algorithms, as typically deep learningneural networks require large data sets to sufficiently train thenetwork. As data acquisition can be a costly expense the embodimentsdescribed in this invention can significantly reduce those costs.

Specifically, the Deep Model Image Translation embodiment describes howto translate from one set of images to another set of images. Thebenefit of this embodiment enables one to use existing data collects,adapt images in those data collects to images required to train theobject recognizer. In addition, this method can be applied to adaptclasses within a data set to other classes in that data set, addressingthe issue of unbalanced data sets, poor quality images and sparsity.

The embodiment described as Deep Model Image Generation describes amethod by which a significantly smaller data set can be used to train anobject recognizer. The Deep Model Image Generation embodiment cangenerate 10s of 1000s of images for each image in a given data set. Thiscan result in a significant cost savings in both data acquisition anddata cleaning efforts to prepare new data collects to be used to trainan object recognizer.

This embodiment also addresses the issue of sparsity and non-existentclasses by filling in data sets where there are otherwise very fewimages. Using the existing data collects, this embodiment describes amethod for generating new images that are closely related to the imagescontained in the existing data set.

Additional benefits include using the embodiments together to overcomecommon issues among machine learning data sets such as sparsity, noise,obfuscation of objects, poor quality images, and non-existent images.

In addition, described is how the embodiments can work with each otherto further improve the methods put forth. First, by initializing thegeneration process with output from the translation process to speed-upimage generation. Second, to improve the translation process successfulrounds of image generation can be used with training data sets as inputinto the translation training process.

Computing Device

FIG. 26 is a block diagram of an exemplary embodiment of a ComputingDevice 2600 in accordance with the present invention, which in certainoperative embodiments can comprise, for example, Object Recognizer (103,203, 303, 415, 415, 515, 1017, 1105, 1219, 1319, 1429), DMTG (401, 501,607, 707, 807, 907, 1007, 1203, 1303, 1403, 2305, 2415), Autoencoder(405, 611, 1011, 1207, 1407, 2309, 2419), Translator (403, 609, 1009,1205, 1405, 2307, 2417), Generator (503, 709, 809, 909, 1305, 1415), GAN(505, 711, 811, 911, 1307, 1417, 2101, 2201, 2215, 2225). ComputingDevice 2600 can comprise any of numerous components, such as forexample, one or more Network Interfaces 2610, one or more Memories 2620,one or more Processors 2630, program Instructions and Logic 2640, one ormore Input/Output (“I/O”) Devices 2650, and one or more User Interfaces2660 that may be coupled to the I/O Device(s) 2650, etc.

Computing Device 2600 may comprise any device known in the art that iscapable of processing data and/or information, such as any generalpurpose and/or special purpose computer, including as a personalcomputer, workstation, server, minicomputer, mainframe, supercomputer,computer terminal, laptop, tablet computer (such as an iPad), wearablecomputer, mobile terminal, Bluetooth device, communicator, smart phone(such as an iPhone, Android device, or BlackBerry), a programmedmicroprocessor or microcontroller and/or peripheral integrated circuitelements, a high speed graphics processing unit, an ASIC or otherintegrated circuit, a hardware electronic logic circuit such as adiscrete element circuit, and/or a programmable logic device such as aPLD, PLA, FPGA, or PAL, or the like, etc. In general, any device onwhich a finite state machine resides that is capable of implementing atleast a portion of the methods, structures, API, and/or interfacesdescribed herein may comprise Computing Device 2600.

Memory 2620 can be any type of apparatus known in the art that iscapable of storing analog or digital information, such as instructionsand/or data. Examples include a non-volatile memory, volatile memory,Random Access Memory, RAM, Read Only Memory, ROM, flash memory, magneticmedia, hard disk, solid state drive, floppy disk, magnetic tape, opticalmedia, optical disk, compact disk, CD, digital versatile disk, DVD,and/or RAID array, etc. The memory device can be coupled to a processorand/or can store instructions adapted to be executed by processor, suchas according to an embodiment disclosed herein. In certain embodiments,Memory 2620 may be augmented with an additional memory module, such asthe HiTech Global Hybrid Memory Cube.

Input/Output (I/O) Device 2650 may comprise any sensory-oriented inputand/or output device known in the art, such as an audio, visual, haptic,olfactory, and/or taste-oriented device, including, for example, amonitor, display, projector, overhead display, keyboard, keypad, mouse,trackball, joystick, gamepad, wheel, touchpad, touch panel, pointingdevice, microphone, speaker, video camera, camera, scanner, printer,vibrator, tactile simulator, and/or tactile pad, optionally including acommunications port for communication with other components in ComputingDevice 2600.

Instructions and Logic 2640 may comprise directions adapted to cause amachine, such as Computing Device 2600, to perform one or moreparticular activities, operations, or functions. The directions, whichcan sometimes comprise an entity called a “kernel”, “operating system”,“program”, “application”, “utility”, “subroutine”, “script”, “macro”,“file”, “project”, “module”, “library”, “class”, “object”, or“Application Programming Interface,” etc., can be embodied as machinecode, source code, object code, compiled code, assembled code,interpretable code, and/or executable code, etc., in hardware, firmware,and/or software. Instructions and Logic 2640 may reside in Processor2630 and/or Memory 2620.

Network Interface 2610 may comprise any device, system, or subsystemcapable of coupling an information device to a network. For example,Network Interface 2610 can comprise a telephone, cellular phone,cellular modem, telephone data modem, fax modem, wireless transceiver,Ethernet circuit, cable modem, digital subscriber line interface,bridge, hub, router, or other similar device.

Processor 2630 may comprise a device and/or set of machine-readableinstructions for performing one or more predetermined tasks. A processorcan comprise any one or a combination of hardware, firmware, and/orsoftware. A processor can utilize mechanical, pneumatic, hydraulic,electrical, magnetic, optical, informational, chemical, and/orbiological principles, signals, and/or inputs to perform the task(s). Incertain embodiments, a processor can act upon information bymanipulating, analyzing, modifying, converting, transmitting theinformation for use by an executable procedure and/or an informationdevice, and/or routing the information to an output device. Processor2630 can function as a central processing unit, local controller, remotecontroller, parallel controller, and/or distributed controller, etc.

Processor 2630 can comprise a general-purpose computing device,including a microcontroller and/or a microprocessor. In certainembodiments, the processor can be dedicated purpose device, such as anApplication Specific Integrated Circuit (ASIC), a high-speed GraphicsProcessing Unit (GPU) or a Field Programmable Gate Array (FPGA) that hasbeen designed to implement in its hardware and/or firmware at least apart of an embodiment disclosed herein. In certain embodiments,Processor 2630 can be a Tegra X1 processor from NVIDIA. In otherembodiments, Processor 2630 can be a Jetson TX1 processor from NVIDIA,optionally operating with a ConnectTech Astro Carrier and Breakoutboard, or competing consumer product (such as a Rudi (PN ESG503) orRosie (PN ESG501) or similar device). In another embodiment, the SFFdevice 750 is the Xilinx proFPGA Zync 7000 XC7Z100 FPGA Module. In yetanother embodiment, Processor 2630 can be a HiTech Global KintexUltrascale-115. In still another embodiment, Processor 2630 can be astandard PC that may or may not include a GPU to execute an optimizeddeep embedding architecture.

User Interface 2660 may comprise any device and/or means for renderinginformation to a user and/or requesting information from the user. UserInterface 2660 may include, for example, at least one of textual,graphical, audio, video, animation, and/or haptic elements. A textualelement can be provided, for example, by a printer, monitor, display,projector, etc. A graphical element can be provided, for example, via amonitor, display, projector, and/or visual indication device, such as alight, flag, beacon, etc. An audio element can be provided, for example,via a speaker, microphone, and/or other sound generating and/orreceiving device. A video element or animation element can be provided,for example, via a monitor, display, projector, and/or other visualdevice. A haptic element can be provided, for example, via a very lowfrequency speaker, vibrator, tactile stimulator, tactile pad, simulator,keyboard, keypad, mouse, trackball, joystick, gamepad, wheel, touchpad,touch panel, pointing device, and/or another haptic device, etc. A userinterface can include one or more textual elements such as, for example,one or more letters, number, symbols, etc. A user interface can includeone or more graphical elements such as, for example, an image,photograph, drawing, icon, window, title bar, panel, sheet, tab, drawer,matrix, table, form, calendar, outline view, frame, dialog box, statictext, text box, list, pick list, pop-up list, pull-down list, menu, toolbar, dock, check box, radio button, hyperlink, browser, button, control,palette, preview panel, color wheel, dial, slider, scroll bar, cursor,status bar, stepper, and/or progress indicator, etc. A textual and/orgraphical element can be used for selecting, programming, adjusting,changing, specifying, etc. an appearance, background color, backgroundstyle, border style, border thickness, foreground color, font, fontstyle, font size, alignment, line spacing, indent, maximum data length,validation, query, cursor type, pointer type, auto-sizing, position,and/or dimension, etc. A user interface can include one or more audioelements such as, for example, a volume control, pitch control, speedcontrol, voice selector, and/or one or more elements for controllingaudio play, speed, pause, fast forward, reverse, etc. A user interfacecan include one or more video elements such as, for example, elementscontrolling video play, speed, pause, fast forward, reverse, zoom-in,zoom-out, rotate, and/or tilt, etc. A user interface can include one ormore animation elements such as, for example, elements controllinganimation play, pause, fast forward, reverse, zoom-in, zoom-out, rotate,tilt, color, intensity, speed, frequency, appearance, etc. A userinterface can include one or more haptic elements such as, for example,elements utilizing tactile stimulus, force, pressure, vibration, motion,displacement, temperature, etc.

The present invention can be realized in hardware, software, or acombination of hardware and software. The invention can be realized in acentralized fashion in one computer system, or in a distributed fashionwhere different elements are spread across several computer systems. Anykind of computer system or other apparatus adapted for carrying out themethods described herein is suitable. A typical combination of hardwareand software can be a general-purpose computer system with a computerprogram that, when being loaded and executed, controls the computersystem such that it carries out the methods described herein.

Although the present disclosure provides certain embodiments andapplications, other embodiments apparent to those of ordinary skill inthe art, including embodiments that do not provide all of the featuresand advantages set forth herein, are also within the scope of thisdisclosure.

The present invention, as already noted, can be embedded in a computerprogram product, such as a computer-readable storage medium or devicewhich when loaded into a computer system is able to carry out thedifferent methods described herein. “Computer program” in the presentcontext means any expression, in any language, code or notation, of aset of instructions intended to cause a system having an informationprocessing capability to perform a particular function either directlyor indirectly after either or both of the following: a) conversion toanother language, code or notation; or b) reproduction in a differentmaterial form.

The foregoing disclosure has been set forth merely to illustrate theinvention and is not intended to be limiting. It will be appreciatedthat modifications, variations and additional embodiments are covered bythe above teachings and within the purview of the appended claimswithout departing from the spirit and intended scope of the invention.Other logic may also be provided as part of the exemplary embodimentsbut are not included here so as not to obfuscate the present invention.Since modifications of the disclosed embodiments incorporating thespirit and substance of the invention may occur to persons skilled inthe art, the invention should be construed to include everything withinthe scope of the appended claims and equivalents thereof.

REFERENCES

-   [1] Goodfellow, Ian, et al. “Generative adversarial nets.” Advances    in neural information processing systems. 2014. Goodfellow, Ian, et    al. “Generative adversarial nets.” Advances in neural information    processing systems. 2014.-   [2] Koehn, Philipp. Statistical machine translation. Cambridge    University Press, 2009.-   [3] Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the    dimensionality of data with neural networks.” science 313.5786    (2006): 504-507.-   [4] Hinton, Geoffrey E., Simon Osindero, and Yee-Whye Teh. “A fast    learning algorithm for deep belief nets.” Neural computation 18.7    (2006): 1527-1554.-   [5] Radford, Alec, Luke Metz, and Soumith Chintala. “Unsupervised    representation learning with deep convolutional generative    adversarial networks.” arXiv preprint arXiv:1511.06434 (2015).-   [6] Pu, Yunchen, et al. “Variational autoencoder for deep learning    of images, labels and captions.” Advances in Neural Information    Processing Systems. 2016.

The invention claimed is:
 1. A method for improving the training of acomputer-based object recognizer to recognize a target object within animage selected from a set of real-world images, the method comprising:training a generative adversarial network (GAN) to produce a pluralityof generated images of the target object, said GAN comprising acomputer-based machine learning system; training the computer-basedobject recognizer to recognize the target object within a newlypresented digital image by providing the object recognizer a collectionof training images, said collection of training images assembled from:(1) a set of real-world images of the target object, and (2) theplurality of generated images of the target object.
 2. The method ofclaim 1 wherein the generated images are produced by providing the GANwith Gaussian noise.
 3. The method of claim 1 wherein the computer-basedmachine learning system has a discriminative neural network instantiatedwith the set of real-world images.
 4. The method of claim 1 wherein thecomputer-based machine learning system has a discriminative neuralnetwork instantiated with the set of real-world images, where each imagein the set of real-world images is labeled according to the targetobject.
 5. The method of claim 1 wherein the computer-based machinelearning system has a generative neural network instantiated withGaussian noise.
 6. The method of claim 1 wherein the computer-basedmachine learning system has a discriminative neural network and agenerative neural network, and wherein GAN training includesinstantiating the generative neural network with Gaussian noise andinstantiating the discriminative neural network with the set ofreal-world images, where each image in the set of real-world images islabeled according to the target object.
 7. The method of claim 3 furthercomprising: creating a set of synthetic images of the target object;wherein the GAN training includes instantiating the discriminativeneural network with the set of synthetic images.
 8. The method of claim3 further comprising: obtaining a set of translated images; wherein theGAN training includes instantiating the discriminative neural networkwith the set of translated images.
 9. The method of claim 1 where theobject recognizer includes a loss function that is invoked iterativelyduring the object recognizer training.
 10. A computer-based objectrecognizer for recognizing a target object within an image selected froma set of real-world images, comprising: a trainable machine learningprocessor; a generative adversarial network (GAN) configured to producea plurality of generated images of the target object; a trainingprocessor programmed to supply to the trainable machine learningprocessor with a collection of training images, assembled from: (1) aset of real-world images of the target object, and (2) the plurality ofgenerated images of the target object.
 11. The computer-based objectrecognizer of claim 10 further comprising a source of Gaussian noise,and wherein the GAN is configured ingest the Gaussian noise as an inputused to produce the plurality of generated images of the target object.12. The computer-based object recognizer of claim 10 wherein the GANincludes a discriminative neural network instantiated with the set ofreal-world images.
 13. The computer-based object recognizer of claim 10wherein the GAN includes a discriminative neural network instantiatedwith the set of real-world images, where each image in the set ofreal-world images is labeled according to the target object.
 14. Thecomputer-based object recognizer of claim 10 wherein the GAN includes agenerative neural network instantiated with Gaussian noise.
 15. Thecomputer-based object recognizer of claim 10 wherein the GAN includes adiscriminative neural network and a generative neural network, andwherein GAN is trained by instantiating the generative neural networkwith Gaussian noise and by instantiating the discriminative neuralnetwork with the set of real-world images, where each image in the setof real-world images is labeled according to the target object.
 16. Thecomputer-based object recognizer of claim 12 further comprising: a setof synthetic images of the target object stored in a computer-readablemedium; wherein the GAN is trained by instantiating the discriminativeneural network with the set of synthetic images.
 17. The computer-basedobject recognizer of claim 12 further comprising: a set of syntheticimages of the target object stored in a computer-readable medium;wherein the GAN is trained by instantiating the discriminative neuralnetwork with the set of translated images.
 18. The computer-based objectrecognizer of claim 10 where the object recognizer includes a lossfunction that is invoked iteratively during the training of thetrainable machine learning processor.